基于机器学习算法的HIV-1 蛋白酶抑制剂的分类研究
Classification of HIV-1 Protease Inhibitors by Machine Learning Methods
Li, Y.; Tian, Y.J.; Qin, Z.J.; Yan, A.X.*
ACS Omega, 2018, 3, 15837-15849
HIV-1蛋白酶在病毒感染过程中起着重要作用, 是治疗HIV-1的有效靶点。我们的数据集是从ChEMBL中获得的4855个HIV-1蛋白酶抑制剂(PIs) 组成。 我们采用K近邻(K-NN)、决策树(DT)、随机森林(RF)、支持向量机(SVM)、深度神经网络(DNN) 建立了15个分类模型来预测抑制剂的活性。通过(1) MACCS和PubChem指纹和 (2) 理化描述符CORINA Symphony分别对抑制剂进行了表征。所有模型对测试集的预测正确率均超过70%; 以MACCS指纹为输入采用SVM算法构建的模型4A的预测正确率最高,为83.07%。 采用三种不同的描述符构建了9个共识模型,将所有机器学习方法结合到“共识预测”中。由MACCS指纹构建的最佳模型C3a对训练集的预测正确率为91.96%、对测试集的预测正确率为83.15%。 利用外部验证集对模型的性能进行了验证,外部验证集包括DUD数据库中的35989种化合物和近期文献中的239个活性抑制剂。 使用CORINA Symphony描述符和RF方法建立的模型3C对外部验证集的预测准确度最高,为98.37%。 此外,通过分子描述符分析表明,发现含有芳香环结构和能形成氢键的基团对PIs的生物活性有重要贡献。
HIV-1 protease plays an important role in the processing of virus infection. Protease is an effective therapeutic target for the treatment of HIV-1. Our data set is based on a selection of 4855 HIV-1 protease inhibitors (PIs) from ChEMBL. A series of 15 classification models for predicting the active inhibitors were built by machine learning methods, including k-nearest neighbors (K-NN), decision tree (DT), random forest (RF), support vector machine (SVM), and deep neural network (DNN). The molecular structures were characterized by (1) fingerprint descriptors including MACCS fingerprints and PubChem fingerprints and (2) physicochemical descriptors calculated by CORINA Symphony. The prediction accuracies of all of the models are more than 70% on the test set; the best accuracy of 83.07% was obtained by Model 4A, which was built by the SVM method based on MACCS fingerprint descriptors. Nine consensus models were built with three kinds of different descriptors, which combined all of the machine learning methods using the “consensus prediction”. Model C3a developed with MACCS fingerprint descriptors showed the highest accuracy on both training set (91.96%) and test set (83.15%). An external validation set including 35989 compounds from DUD database and 239 active inhibitors from the recent literature was used to verify the performance of our model. The best prediction accuracy of 98.37% was obtained by model 3C, which was built by RF based on CORINA Symphony descriptors. In addition, from the analysis of molecular descriptors, it shows that the aromatic system and atoms related to hydrogen bonding provide important contributions to the bioactivity of PIs.
Classification Models performance: Dataset (4855 protease inhibitors)
Model Name | Algorithm | Descriptors | Training set accuracy (%) | Training set 5-fold cross-validation accuracy (%) | Test set SE | Test set SP | Test set accuracy (%) | Test set MCC |
---|---|---|---|---|---|---|---|---|
Model 1A | kNN | MACCS | 89.52 | 79.02 | 0.76 | 0.83 | 80.48 | 0.60 |
Model 1B | kNN | PubChem | 89.89 | 80.07 | 0.77 | 0.83 | 80.91 | 0.61 |
Model 1C | kNN | CORINA | 83.02 | 78.19 | 0.78 | 0.78 | 78.01 | 0.54 |
Model 2A | DT | MACCS | 86.00 | 76.79 | 0.76 | 0.77 | 76.59 | 0.51 |
Model 2B | DT | PubChem | 89.52 | 79.57 | 0.77 | 0.79 | 78.01 | 0.54 |
Model 2C | DT | CORINA | 80.89 | 71.22 | 0.64 | 0.75 | 70.49 | 0.40 |
Model 3A | RF | MACCS | 88.10 | 80.62 | 0.83 | 0.79 | 80.35 | 0.59 |
Model 3B | RF | PubChem | 88.41 | 80.84 | 0.83 | 0.78 | 79.92 | 0.59 |
Model 3C | RF | CORINA | 83.09 | 75.68 | 0.74 | 0.74 | 74.04 | 0.46 |
Model 4A | SVM | MACCS | 92.31 | 81.80 | 0.82 | 0.84 | 83.07 | 0.65 |
Model 4B | SVM | PubChem | 90.91 | 83.37 | 0.83 | 0.82 | 82.58 | 0.64 |
Model 4C | SVM | CORINA | 85.94 | 78.59 | 0.80 | 0.80 | 79.56 | 0.58 |
Model 5A | DNN | MACCS | 90.98 | 79.71 | 0.80 | 0.84 | 82.27 | 0.63 |
Model 5B | DNN | PubChem | 90.79 | 82.69 | 0.81 | 0.82 | 81.53 | 0.62 |
Model 5C | DNN | CORINA | 82.65 | 76.27 | 0.76 | 0.79 | 77.82 | 0.54 |
Consensus models: 4855 protease inhibitors
Model Name | Methods | Descriptors | Training set accuracy (%) | Test set accuracy (%) |
---|---|---|---|---|
Model C3a | C3: more than three of the five individual classification model predictions are correct | MACCS | 91.96 | 83.15 |
Model C3b | C3: more than three of the five individual classification model predictions are correct | PubChem | 91.44 | 82.53 |
Model C3c | C3: more than three of the five individual classification model predictions are correct | CORINA | 85.97 | 79.07 |
Model C4a | C4: more than four of the five individual classification model predictions are correc | MACCS | 86.65 | 74.75 |
Model C4b | C4: more than four of the five individual classification model predictions are correc | PubChem | 87.64 | 77.41 |
Model C4c | C4: more than four of the five individual classification model predictions are correc | CORINA | 78.78 | 69.25 |
Model C5a | C5: all of the five individual classification model predictions are correct | MACCS | 76.21 | 62.90 |
Model C5b | C5: all of the five individual classification model predictions are correct | PubChem | 77.94 | 63.77 |
Model C5c | C5: all of the five individual classification model predictions are correct | CORINA | 63.51 | 52.17 |
主要项目成员
博士研究生
博士研究生
1204429112@qq.com